Clustering Ensemble for Spam Filtering
نویسندگان
چکیده
One of the main problems that modern e-mail systems face is the management of the high degree of spam or junk mail they recieve. Those systems are expected to be able to distinguish between legitimate mail and spam; in order to present the nal user as much interesting information as possible. This study presents a novel hybrid intelligent system using both unsupervised and supervised learning that can be easily adapted to be used in an individual or collaborative system. The system divides the spam ltering problem into two stages: rstly it divides the input data space into di erent similar parts. Then it generates several simple classi ers that are used to classify correctly messages that are contained in one of the parts previously determined. That way the e ciency of each classi er increases, as they can specialize in separate the spam from certain types of related messages. The hybrid system presented has been tested with a real e-mail data base and a comparison of its results with those obtained from other common classi cation methods is also included. This novel hybrid technique proves to be e ective in the problem under study.
منابع مشابه
Ensemble Classification for Spam Filtering Based on Clustering of Text Corpora
Spam filtering has become a very important issue throughout the last years as unsolicited bulk e-mail imposes large problems in terms of both the amount of time spent on and the resources needed to automatically filter those messages. Text information retrieval offers the tools and algorithms to handle text documents in their abstract vector form. Thereon, machine learning algorithms can be app...
متن کاملClustering Based Ensemble Classification for Spam Filtering
Spam filtering has become a very important issue throughout the last years as unsolicited bulk e-mail imposes large problems in terms of both the amount of time spent on and the resources needed to automatically filter those messages. Text information retrieval offers the tools and algorithms to handle text documents in their abstract vector form. Thereon, machine learning algorithms can be app...
متن کاملSingle-Class Learning for Spam Filtering: An Ensemble Approach
Spam, also known as Unsolicited Commercial Email (UCE), has been an increasingly annoying problem to individuals and organizations. Most of prior research formulated spam filtering as a classical text categorization task, in which training examples must include both spam emails (positive examples) and legitimate mails (negatives). However, in many spam filtering scenarios, obtaining legitimate ...
متن کاملSpamCooling: A Parallel Heterogeneous Ensemble Spam Filtering System Based on Active Learning Techniques
Anti-spam technology is developing rapidly in recent years. With the emerging applications of machine learning in diverse fields, researchers as well as manufacturers around the world have attempted a large number of related algorithms to prevent spam. In this paper, we designed an effective anti-spam protection system, SpamCooling, based on the mechanism of active learning and parallel heterog...
متن کاملA Comparison of Ensemble and Case-Base Maintenance Techniques for Handling Concept Drift in Spam Filtering
The problem of concept drift has recently received considerable attention in machine learning research. One important practical problem where concept drift needs to be addressed is spam filtering. The literature on concept drift shows that among the most promising approaches are ensembles and a variety of techniques for ensemble construction has been proposed. In this paper we consider an alter...
متن کاملFeature Weight Optimization Mechanism for Email Spam Detection based on Two-Step Clustering Algorithm and Logistic Regression Method
This research proposed an improved filtering spam technique for suspected emails, messages based on feature weight and the combination of two-step clustering and logistic regression algorithm. Unique, important features are used as the optimum input for a hybrid proposed approach. This study adopted a spam detector model based on distance measure and threshold value. The aim of this model was t...
متن کامل